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We demonstrate that the recently proposed pruned-enriched Rosenbluth method PERM 
(P. Grassberger, Phys. Rev. E 56 (1997) 3682) leads to very efficient algorithms for the 
folding of simple model proteins. We test it on several models for lattice heteropolymers, 
and compare to published Monte Carlo studies of the properties of particular sequences. 
In all cases our method is faster than the previous ones, and in several cases we find new 
minimal energy states. In addition to producing more reliable candidates for ground 
states, our method gives detailed information about the thermal spectrum and, thus, 
allows to analyze static aspects of the folding behavior of arbitrary sequences. 



1 Introduction 

Protein folding0@i0 is one of the most interesting and challenging problems in poly- 
mer physics and mathematical biology. It is concerned with the problem of how a 
heteropolymer of a given sequence of amino acids folds into precisely that geor 
rical shape in which it performs its biological function as a molecular machinel! 
Currently, it is much simpler to find coding DNA — and, thus, also amino acid — 
sequences than to elucidate the 3-g? structures of given proteins. Therefore, solving 
the protein folding problem would be a major break-through in understanding the 
biochemistry of the cell and, furthermore, in designing artificial proteins. 

In this contribution we are concerned with the direct approach: given a sequence 
of amino acids, a molecular potential, and no other information, find the ground 
state and the equilibrium state at physiological temperatures. Note that we are 
not concerned with the kinetics of folding, but only in the final outcome.—pAlso, 
we will not address the problems of how to find good molecular potentialsPEnj and 
what is the proper level of detail in describing proteinsul Instead, we will use simple 
coarse-grained models which have been proposed in the literature and have become 
standards in testing the efficiency of folding algorithms. 

Lots of methods have been proposed to solve this problem, ranging, from simple 
Metropolis Monte Carlo simulations at some nonzero temperature E2I over multi- 
canonical simulatiomapproachesliil to stochastic. optimization schemes based, e.g., on 
simulated-aimealingpl and genetic algorithms i3t3 Alternative methodsjise heuristic 
principles^ information from databases of known protein structures^! sometimes 
in combination with known physico-chemical properties of small peptides. 

The algorithms we apply here are variants of the pruned-enriched Rosenbluth 
method (PERM) U. This is a chain growth approach based on the Rosenbluth- 
Rosenbluth (RR)lia method. We will provide details on the algorithm and on the 
analyses that can be performed, and we will present more detailed results on ground 



state and spectral properties, and on the folding behavior of the sequences analyzed. 

2 The Models 

The models we study in this contribution are heteropolymers that live on 2- and 
3-dimensional regular lattices. They are self-avoiding chains with attractive or 
repulsive interactions between neighboring non-bonded monomers. 

The majority of authors considered only two kinds of monomers. Although 
also different interpretations are possible for such a binary choice, e.g. in terms of 
positive and negative electric charges^ the most important model of this class is 
the HP modelBEl There, the two monomer types are assumed to be hydrophobic 
(H) and polar (P), with energies ehh = — 1j ^hp = epp = for interaction between 
not covalently bound neighbors. Since this parameter set leads to highly degenerate 
ground states, alternative parameters were proposed, e.g. e = (—3, —1, — 3)E3 and 
e — (—1, 0, — Note, however, that in these latter parameter sets, since they are 
symmetric upon exchange of H and P, the intuitive distinction between hydrophilic 
and polar monomers gets lost. 

In the other extremal case of models, all monomers of a sequence are consid- 
ered to be different, and interaction energies are drawn randomly from a continuous 
distributional These models correspond, effectively, to assuming an infinite num- 
ber of monomer types. 

3 The Algorithm 

The algorithms we apply here are variants of the pruned-enriched Rosenbluth method 
(PERM)P a chain growth algorithm based on the Rosenbluth-Rosenbluth (RR)t3 
method. Monomers are added sequentially, the n-th monomer being placed at site 
i with probability p n (i). In simple sampling, p n (i) is uniform on all neighbors of 
the last monomer, leading to exponential attrition. The original RR method avoids 
this by using a uniform p n (i) on all vacant neighbors of i n -i- More generally, we 
call any non- uniform choice of p n (i) a generalized RR method. The relative ther- 
mal weight of a particular chain conformation of length n is then determined by 
W„ = m„ exp(— (3AE n )Wn-i, with W\ — 1; AE n is the energy gain from adding 
monomer n; and m„ is the Rosenbluth factor, m n = J2j e{nn} Pn(j) /Pn(i) ■ We note 
that W n is also an estimate for the partition function Z„ of the n-monomcr chainS 
Chain growth is stopped when the final size is reached and started new from n = 1. 

In easy cases, p n {i) can be chosen so that Boltzmann and Rosenbluth factors — 
or Rosenbluth factors for different n — cancel, leading to narrow weight distributions. 
But in general, this algorithm produces a wide spread in weights that can lead to 
serious problemso On the other hand, since the weights accumulate as the chain 
grows, one can interfere during the growth process by 'pruning' conformations with 
low weights and enriching high- weight conformati ons T his is, in principle, similar to 
population based methods in polymer simulationsEj'Ej and in quantum Monte Carlo 
(MC)o However, our implementation is different. Pruning is done stochastically: if 
the weight of a conformation has decreased below a threshold W£ , it is eliminated 
with probability 1/2, while it is kept and its weight is doubled in the other half of 
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cases. Enrichment E3 is done independently of this: if W n increases above another 
threshold , the conformation is replaced by n c copies, each with weight W n /n c . 
Technically, this is done by putting onto a stack all information about conformations 
whichjstill have to be copied. This is most easily implemented by recursive firnction 
callsEO Thereby the need for keeping large populations of conformations E3t3o is 
avoided. PERM_has proven extremely efficient for studies of lattice homopolymers 
near the fi_point[LZl their phase equilibriapl and of the ordering transition in semi-stiff 
polymers E3 

The main freedom when applying PERM consists in the a priori choice of the 
sites where to place the next monomer, in the thresholds W < and W > for pruning 
and copying, and in the number of copies made each time. All these features do 
not affect the formal correctness of the algorithm, but they can greatly influence 
its efficiency. They may depend arbitrarily on chain lengths and on local configura- 
tions, and they can be changed freely at any time during the simulation. Thus the 
algorithm can 'learn' during the simulation. 

In order to apply-EERM to heteropolymers at very low temperatures, the strate- 
gies proposed in RefO are modified as follows. 

(1) For homopolymers near the theta-point it had been found that the best 
choice for the placement of monomers. was not according to their Boltzmann weights, 
but uniformly on all allowed sitesED'Eil This might be surprising since the Boltzmann 
factor has then to be included into the weight of the configuration, which might lead 
to large fluctuations. Obviously, this effect is counterbalanced by the fact that larger 
Boltzmann factors correspond to higher densities and thus to smaller Rosenbluth 
factorsN 

For a heteropolymer this has to be modified, as there is no longer a unique 
relationship between density and Boltzmann factor. In a strategy of 'anticipated 
importance sampling' we should preferentially place monomers on sites with mostly 
attractive neighbors. Assume that we have two kinds of monomers, and we want to 
place a type- A monomer. If an allowed site has raj neighbors of type B (B — H,P), 
we select this site with a probability a 1 + aAH^H + dApirip. Here, clab & rc 
constants with clab > for €ab < and vice versa. 

(2) Most naturally, W > and W < are chosen proportional to the estimated 
partition sum Z n \J\ This becomes inefficient at very low T since Z n will be underes- 
timated as long as no low-energy state is found. When this finally happens, W > is 
too small. Thus too many copies are made which are all correlated but cost much 
CPU time. 

This problem-can be avoided by increasing W and W K during particularly 
successful 'tours' £3 (a tour is the set of configurations derived from a single start). 
But then also the average number of long chains is decreased in comparison with 
short ones. To reduce this effect and to create a bias towards a sample which is flat 
in chain length, we multiply by some power of M n /Mi, where M n is the number of 
generated chains of length n. With J\f(n) denoting the number of chains generated 
during the current tour we used therefore 

W< = C Z n [(1 + Af(n)/M)(M n + M)/(M 1 + M)} 2 , 

and W > = rW < . Here, C is a constant of order unity, r « 10, and M is a constant 
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of order 10 4 - 10 5 . 

d3) Creating only one additional copy at each enrichment event (as done in 
RefO), cannot prevent the weights from exploding at very low T. Thus we have to 
make several copies if the weight is large and surpasses W > substantially. A good 
choice for the number of new copies created when W > W > is int[l + y/W/W > ]. 

(4) Two special tricks were employed for 'compact' configurations of the 2-d HP 
model filling a square. First of all, since we know in this case where the boundary 
should be, we added a bias for polar monomers to actually be on that boundary, 
by adding an additional energy of —1 per boundary site. Note that this bias has 
to be corrected in the weights, thus the final distributions are unaffected by it and 
unbiased. Secondly, in two dimensions we can immediately delete chains which cut 
the free domain into two disjoint parts, since they never can grow to full length. 
In the present simulations, we checked for this by looking ahead one time step. In 
spite of the additional work this was very efficient, since it reduced considerably the 
time spent on dead-end configurations. 

(5) In some cases we did not start to grow the chain from one end but from 
a point in the middle. We grew first one half, and then the other. Results were 
averaged over all possible starting points. The idea behind this is that real proteins 
have folding nucleilp and it should be most efficient to start from such a nucleus. 
In some cases this trick was very successful and speeded up the ground state search 
substantially, in others not. We take this observation as an indication that in various 
sequences the end groups already provide effective nucleation sites. This is e.g. the 
case for the 80-mer with modified HP interactions of RefEl We also tried to grow 
the chain on both sides (Simultaneously. However it turned out that this is not 



(6) For an effective sampling of low-lying states the choice of simulation tem- 
perature T appears to be of importance. If it is too large, low-lying states will have 
a low statistical weight and will not be sampled reliably. On the other hand, if T 
is too low, the algorithm becomes too greedy: configurations which look good at 
first sight but lead to dead ends are sampled too often, while low energy configu- 
rations, whose qualities become apparent only at late stages of the chain assembly, 
are sampled rarely. Of course they then get huge weights (since the algorithm is 
correct after all), but statistical fluctuations become huge as well. This is in com- 
plete analogy to the slow relaxation hampering more traditional (Metropolis type) 
simulations at low T — note, however, that "relaxation" in the proper sense does not 
exist in the present algorithm. 

In the cases we considered it turned out to be most effective to choose a tem- 
perature that is below the collapse transition temperature (note, however, that this 
transition is smeared out, see the results below) but somewhat above the temper- 
ature corresponding to the structural transition which leads to the native state. 
This observation corresponds qualitatively to the considerations of RefEa, although 
a quantitative comparison appears not to be possible. In some cases it also helped 
to reduce the 'greediness' of the algorithm by not making any pruning/enrichment 
during the first steps. 
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4 The Sequences 



For, the, above models various sequences were analyzed in the literature, and in 
RefEjO we took these analyses as a test for our algorithm. Here we will take a 
closer look at the properties of some of the sequences that were considered there. 

4.1 2d HP model 

Two-dimensional HP chains were used in several papers as test cases for folding 
algorithms. Two chains with N = 100 were studied in RefEi The authors claimed 
that their native configurations were compact, fitting exactly into a f x 10 square, 
and had energies —44 and —46, see Table 1 for the sequences and Fig. [j]and[| for the 
respective proposed ground state structures. These conformations were found by a 
specially designed MC algorithm which should be particularly efficient for compact 
configurations. 

4.2 3d HP model 

Ten sequences of length N — 48 were given in Ref0. Each of these sequences was 
designed by minimizing the energy of a particular target conformation in sequence 
space under the constraint of constant compositions The authors tried to find the 



Table 1: Newly found lowest energy states for binary sequences with interactions e = 
(tHH,£HP,£pp)- Configurations are encoded as sequences of r(ight), /(eft), «(p), d(own), 
/(orward), and ft(ackward). 



N , d 


sequence 


old £ mi „ Rct - 


e 


configuration 


new E min 


100 , 2 


P 6 HPH2P 5 H 3 PH 5 PH 2 P2(P2H2)2PH 5 PHi - 




(-1,0,0) 


PH 2 PH 7 P ll H 7 P2HPH 3 P e ,HPH 2 


-440 




r§ur2U%rdsluldl2drd2ru2rz(rulu)2urdrd,2ruzlurzdld2 — 






rur^ dzkuldl2dzru2rzdil2UTul 


-47 




rdldldrd2 Ti dzhdrdldr2 dfodfe (urul ) 2 urur2 ul2U2l2drd2 — 






lul2uru2r2U4rulzdrdzl2d2ldlu2TU2luzrd2rdr 


-47 




Uzrzur5dl4drd2r2ulur4dr2dld2lu2luld2rdl2ulzdr2dr2d— 






rurdrzd2luld2lu2l2drd2luluzl2ul2Uzru 


-47 




rdrzdldlzdrdrur2ur2ulur2urd2(ldrd)2l2U2ldl2dr2drzdl2d— 






luluhuhdz Idrz U2rdrdldldr2 urur^uz rul 


-47 




rdzrdldl2uldzld2rurzdl2drgulzU2lzw>'zur2dld2rur2ulu2 — 






l2Uzluzr2dr y r2d2rdru2lu2r2U2ldl2dl 


-47 




rzuzru2ruzl2ur2ul2urul4dr2dldrdld2 rur2dld2luld2rdlz~ 






uru2ldldzlzururu2rur 2 ululdldl2Uzld4r 


-47 


100 , 2 


Pz H 2 P2 HiP2Hz{PH 2 )z H 2 P S Ha P 2 H 6 P 9 HPHiPH 11 - 




(-1,0,0) 


P 2 H 3 PH 2 PHP 2 HPH 3 P 6 H 3 


-460 




ru2ldlu2ld2lu2lurulur 2 d2 ru^Tidldiruzrdrdldsrdruzruliiis — 






rdr 2 ululu2{rd)zruT2dldld4 z lulu2ru 


-49 




uzrdru2rd2ru2r2U2ldluldlu5ldel2d2luzr2U6l2dzldr2dl2 — 






dldlu2VU2ldl2drdl2d2rurdrdzruzru 


-49 




ul2drdl2Uzldnldrdl2U2l2dzl2W>'Uzr2Uzrdzru4rul$dldr2 — 






d2luldldrdldluzlul2ulur2dr2Uzrd4l 


-94SEI 


80 , 3 


PH2Pz{HzP 2 HzPzH 2 Pz)zH i P i {HzP2HzPzH2Pz)H2 




(-1,0,-1) 


Ibruflbl2br2drur2dldlzulfrdrzurfldlzulururzdrblulz — 






brzblzdldrdrzurul2dlu 


-98 
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Figure 1: Putative compact native structure Figure 2: Putative compact native structure 
of sequenne-il from Table 1 (E = —45) accord- of sequencp-p from Table I [E = —46) accord- 
ing to RefE3; (filled circle) H monomers, (open ing to RefE3. 
circle) P monomers. 

lowest energy states with two different methods, one being an heuristic-stochastic 
approachEj the other based on exact enumeration of low energy statesEJ With the 
first method they failed in all but one case to reach the lowest energy. With the 
second method in all but one cases they obtained conformations with energies that 
were even lower than the putative ground states the sequences were designed for, 
while for one case the ground state energy was confirmed. Precise CPU times were 
not quoted. 

4-3 3d modified HP model 

A most interesting case is a 2-species 80-mer with interactions (—1,0,-1) studied 
first in RefS. These particular interactions were chosen instead of the HP choice 
(—1,0,0) because it was hoped that this would lead to compact configurations. 
Indeed, the sequence was specially designed to form a "four helix bundle" which 
fits perfectly into a 4 x 4 x 5 box, see Fig. M Its energy in this putative native 
state is E = —94. Although O'Toole et alc3 used highly optimized codes, they 
were not able to recover this state by MC. Instead, they reached only E = —91. 
Supposedly, a different state with E = —94 was found in RefL3, but figure 10 of 
this paper, which is claimed to show this configuration, has a much higher value of 
E. Configurations with E =^=94 but slightly different from that in Refra and with 
E = — 95 were found in Refu3 by means of an algorithm similar to that in RefE3. 
For each of these low energy states the author needed about one week of CPU time 
on a Pentium. 

4-4 3d, infinitly many monomer types 

Sequences with N = 27 and with continuous interactions were studied by Klimov et 
aZc3. Interaction strengths were sampled from Gaussians with fixed non-zero mean 
and fixed variance. These N(N — l)/2 numbers were first attributed randomly to 
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Figure 3: Putative native state of the "four helix bundle" sequence, see Table 1, as proposed by 
O'Toole et al. It has E = —94, fits into a rectangular box, and consists of three homogeneous 
layers. Structurally, it can be interpreted as four helix bundles. 

the monomer pairs, then they were randomly permuted, using a Metropolis ac- 
cept/reject strategy with a suitable cost function, to obtain good folders. Such 
"breeding" strategies to obtain gand, folders were also developed and employed by 
other authors for various models pjKra and seem necessary to eliminate sequences 
which fold too slowly and/or unreliably. It is believed that also during biologi- 
cal evolution optimization processes took place with similar effects, so that actual 
proteins are better folders than random amino sequences. 

5 Results 

Let us now discuss our results. All CPU times quote below refer to SPARC Ultra 
machines with 167 MHz. 

5.1 2d HP model 

For the two HP chains of RefEl with N = 100, see Table 1, we found several 
compact states (within ca. 40 hours of CPU time) that hadLfinergies lower than those 
of the compact putative ground states proposed in RefO. Figures |J and [| show 
representative compact structures with E = —46 for sequence 1 and E = —47 for 
sequence 2. Moreover, we found several non-compact configurations with energies 
even lower: the lowest energies found within 1-2 days of CPU time had E — —47 
and E = —48 for sequence 1 and 2, respectively. Forbidding non-bonded HP pairs, 
we obtained even E = —49 for sequence 2. Figures || and [?] show representative non- 
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the r«ative" state proposed by Ramakrishnan 




Figure 6: One of the (non-compact) lowest Figure 7: One of the (non-compact) lowest 
energy sequences for sequence 1 (E = —47). energy sequences for sequence 2 (E = —49). 

compact structures with these energies; a non-exhaustive collection of these is listed 
in Table 1. These results reflect the well-known property that HP sequences (and 
those of other models) usually have ground states that are not maximally compact, 
see, e.g. Yue et a&B, although there is a persistent prejudice to the contraryn3l^3'C3 

5.2 3d HP model 

With PERM we succeeded to reach ground states of the ten sequences of length 
TV = 48 given in Refo in all cases, in CPU times between a few minutes and one 
day, see Table 2. In these simulations we used a rather simple version of PERM, 
where we started assembly always from the same end of the chain. We found that 
the sequences most difficult to fold were also those which had resisted previous 
Monte Carlo attemptsua In those cases where a ground state was hit more than 
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Table 2: PERM performance for the binary sequences from Re: 



sequence 


rp a 


—E M c b 


77 C 
' ''success 


CPU time 


nr. 








(min) 


1 


32 


31 


101 


6.9 


2 


34 


32 


16 


40.5 


3 


34 


31 


5 


100.2 


4 


33 


30 


5 


284.0 


5 


32 


30 


19 


74.7 


6 


32 


30 


24 


59.2 


7 


32 


31 


16 


144.7 


8 


31 


31 


11 


26.6 


9 


34 


31 


1 


1420.0 


10 


33 


33 


16 


18.3 



Ground state energies 
b Previously reached energies with Monte Carlo methodsta 
c Number of independent tours in which a ground state was hit. 




once, we verified also that the ground states were highly degenerate. In no case 
there were gaps between ground and first excited states, see Fig. ||. Therefore, none 
of these sequences is a good folder, though they were designed specifically for this 
purpose. 

5.3 3d modified HP model 

For the two-species 80-mer with interactions (— 1, 0, — 1), even without much tuning 
our algorithm gave E = —94 after a few hours, but it did not stop there. After 
a number of rather disordered configurations with successively lower energies, the 
final candidate for the native state has E = —98. It again has a highly symmetric 
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Figure 9: Conformation of the "four helix bundle" sequence with E = —98. We propose that this 
is the actual ground state. Its shape is highly symmetric although it does not fit into a rectangular 
box. It is not degenerate except for a flipping of the central front 2x2x2 box. 

shape, although it does not fit into a 4 x 4 x 5 box, see Fig. ||. It has twofold 
degeneracy (the central 2x2x2 box in the front of Fig. |^ can be flipped), and 
both configurations were actually found in the simulations. Optimal parameters for 
the ground state search in this model are (3 = 1/kT 2.0, app = anti ~ 2, and 
ap,p ~ —0.13. With these, average times for finding E = —94 and E = —98 in new 
tours are ca. 20 min and 80 hours, respectively. 

A surprising result is that the monomers are arranged in four homogeneous 
layers in Fig.J|, while they had formed only three layers in the putative ground 
state of Fig. ^Since the interaction should favor the segregation of different type 
monomers, one might have guessed that a configuration with a smaller number of 
layers should be favored. We see that this is outweighed by the fact that both 
monomer types can form large double layers in the new configuration. Again, our 
new ground state is not 'compact' in the sense of minimizing the surface, and hence 
it also disagrees with the wide spread prejudice that native states are compact. 

In terms of secondary structure, the new ground state is fundamentally differ- 
ent from the putative ground state of Refuel. While the new structure (Fig. ^|) is 
dominated by (3 sheets, which can most clearly be seen in the contact matrix (see 
Fig. [l(]) , the structure in Fig. ||| is dominated by helices, see also the corresponding 
contact matrix in Fig. [ll| 

In order to analyze the folding transition of this sequence we again constructed 
histograms of the distribution of energy, end to end distance, and radius of gyration, 
by combining the results obtained at various temperatures between T = 0.45 and 3. 
Figure [l2| shows these distributions, reweighted so that it corresponds to T — 0.75. 
The thermal behavior of these order parameters as functions of T is obtained by 
Laplace transform, and is shown in Fig. O. The behavior of energy, end-to-end dis- 



10 



10 20 30 40 50 60 70 80 

Figure 10: Contact matrix of the structure in 
Fig. Br a black point at (i, j) indicates that there 
is a contact between monomer i and monomer 
j; grey points indicate contacts in only one 
of the two native states, corresponding to the 
twofold degeneracy of the central 2x2x2 box. 
Note that the lines orthogonal to the main di- 
agonal correspond to anti-parallel /3-shpqt sec- 
ondary structure elements, see e.g. Refc3. 
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- Energy 

Figure 12: Histograms of (top) thermal 
weight, (middle) radius of gyration, R^, 
and (bottom) end-to-end distance, R^. e , 
vs energy E for the 80-mer "four helix 
bundle" at T = 0.75. 
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Figure 11: For comparison, the contact-*na- 
trix of the putative ground state of RefO in 
Fig. H; note that point triples close to the di- 
agonal parallel as well as orthogonal to it are 
signatures of 3d helicalrspcondary structure 
elements, see e.g. Refc3; the other points 
denote tertiary contacts between helices. 
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Figure 13: (top) Average end-to-end distance, 
Rg e , and radius of gyration, Rq, (middle) spe- 
cific heat per monomer, C v , and average energy 
per monomer, < E >, vs temperature T for the 
80-mer "four helix bundle" . 
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Figure 14: Thermally averaged contact matrix 
for the 80-mer "four helix bundle" in the col- 
lapsed but unstructured phase (T = 0.85). 



Figure 15: Thermally averaged contact matrix 
for the 80-mer "four helix bundle" in the in- 
termediate helix-dominated phase (T = 0.5). 




Figure 16: Thermally averaged contact matrix 
for the 80-mer "four helix bundle" at the tran- 
sition from the intermediate helix-dominated 
to the /3-sheet dominated phase (T = 0.4). 




Figure 17: Thermally averaged contact matrix 
for the 80-mer "four helix bundle" in the (3- 
sheet dominated phase (T = 0.3). 



tance and radius of gyration follow closely each other and exhibit only the smeared 
out collapse of the chain from a random coil to some unstructured compact phase. 
In contrast, the specific heat exhibits more structure: the shoulder around T = 1 
corresponds again to the coil-globule collapse, but there are additional transitions 
seen around T = 0.62 and T = 0.45. The last one is the transition to the /3-sheet 
dominated native phase. However, the transition at T = 0.62 is from a unstruc- 
tured globule to an intermediate phase that is helix-dominated but exhibits strong 
tertiary fluctuations. These structural transitions are illustrated in Figs 14 to [l7] 
where the thermally averaged contact matrices are shown for the respective phases. 
The intermediate, helix-dominated phase is particularly interesting. To it apply 
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some of the usual characteristics of a molten globule statc3: i) compactness, ii) 
large secondary structure content (although not necessarily native), and iii) strong 
fluctuations. 



5-4 3d, infinitly many monomer types 

For all sequences with N — 27 from RefS we could reach the supposed ground state 
energies within < 1 hour. In no case we found energies lower than those quotedjii 
Refc3, and we verified also the energies of low-lying excited states given in Ref — . 
Notice that these sequences were designed to be good folders by Klimov et alc^. 
This time the design had obviously been successful, which is mainly due to the fact 
that the number of different monomer types is large. All sequences showed some 
gaps between the ground state and the bulk of low-lying states, although these gaps 
are not very pronounced in some cases. 

More conspicuous than these gaps was another feature: all low lying excited 
states were very similar to the ground state, as measured by the fraction of contacts 
which existed also in the native configuration. Stated differently, if the gaps were 
not immediately obvious, this was because they were filled by configurations which 
were very similar to the ground state and can therefore easily transform into the 
native state and back. Such states therefore cannot prevent a sequence from being a 
good folder. For none of the sequences of RefEj we found truely misfolded low-lying 
states with small overlap with the ground state. 

Figure [l^ illustrates this feature for one particular sequence. There we show 
the overlap Q, defined as the fraction by non-bonded nearest-neighbor ground state 
contacts which exist also in the excited state, against the excitation energy. For this 
and for each of the following figures, the 500 lowest-lying states were determined. 
We see no low energy state with a small value of Q. To demonstrate that this is 
due to design, and is not a property of random sequences with the same potential 
distribution, we show in Fig. [l9] the analogous distribution for a random sequence. 

To demonstrate that this difference is not merely due to a statistical fluctuation, 
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Figure 18: Overlap with ground state, Q, vs 
energy, E, of the lowest energy conformations 
for sequence no. 70 of Ref£j. 
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Figure 19: Overlap with ground state, Q, vs 
energy, E, of the lowest energy conformations 
for a single random sequence. 
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Figure 20: Overlap with ground state, Q, vs 
energy, E, of the lowest energy conformations 
for sequences no. 61 to 70 of Ref£j; for bet- 
ter visibility, the same symbol is used for all 
sequences. 
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Figure 21: Overlap with ground state, Q, vs 
energy, E, of the lowest energy conformations 
for ten random sequences; for better visibil- 
ity, the same symbol is used for all sequences. 



we show in Fig. [20|the distributions for ten sequences from Rcf B collected in a single 
plot. Since the ground state energies differ considerably for different sequences, we 
used normalized excitation energies (E — Eq)/Eq on the x-axis. Analogous results 
for ten random sequences are shown in Fig. ^ll While there is no obvious correlation 
between Q and excitation energies for the random case, all low energy states with 
small Q have been eliminated in the designed sequences. 

This elimination of truely misfolded low energy states without elimination of 
native-like low en£rgy states might be an unphysical property of the design proce- 
dure used in RefH, but we do not believe that this is the case. Rather, it should 
be a general feature of any design procedure, including the one due to biological 
evolution. It contradicts the claim of Refo that it is only the gap between native 
and first excited state which determines foldicity. On the other hand, out, results 
are consistent with the "funnel" scenario for the protein folding processpl where 
the folding pathway consists of states successively lower in energy and closer to the 
native state. 

We note that for random sequences there are also excited states that have unit 
overlap with the native state, a feature not present in the folding sequences. These 
are cases where the native state has open loops and/or dangling ends, so that 
more compact conformations have all contacts of the native state, but have — in 
addition — energetically unfavorable contacts resulting in a higher total energy. 



6 Summary and Outlook 

We showed that the pruned-enriched Rosenbluth method (PERM) can be very 
effectively applied to protein structure prediction in simple lattice models. It is 
suited for calculating statistical properties and is very successful in finding native 
states. In all cases it did better than any previous MC method, and in several cases 
it found lower energy states than those which had previously been conjectured to 
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be native. 

We verified that ground states of the HP model are highly degenerate and 
have no gap, leading to bad folders. For sequences that are good folders we have 
established a funnel structure in state space: low-lying excited states of well-folding 
sequences have strong similarities to the ground state, while this is not true for 
non-folders with otherwise similar properties. 

Especially we have presented a new candidate for the native configuration of 
a "four helix bundle" sequence which had been studied before by several authors. 
The ground state structure of the "four helix bundle" sequence, being actually 
/3-sheet dominated, differs strongly from the helix-dominated intermediate phase. 
This sequence, therefore, should not be a good folder. 

Although we have considered only lattice models in this paper, we should stress 
that this is not an inherent limitation of PERM. Straightforward extensions to off- 
lattice system^ are possible and are efficient for homopolymers at relatively high 
temperatuipesEZI Recently performed simulations of a 2-dimensional off-lattice pro- 
in modeled have shown that PERM is not quite as fast as the simulated tempering 
algorithm used by Irback et alci. But there are several ways to improve PERM. 
One way could be to adopt a strategy similar to simulated tempering. Another 
improvement of PERM which could be particularly useful for off-lattice simulations 
might consist in more sophisticated algorithms for positioning the monomers when 
assembling the chain. In particular one might try to use some feedback of suc- 
cessfully build up low energy states or subchains for the positioning of previous 
monomers in the following tours. Finally, one can use configurations reached by 
PERM as a starting point for alternative (Metropolis or greedy) searches. Work 
along these lines is in progress, and we hope to report on it soon. 
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